Search CORE

9 research outputs found

Wasserstein Introspective Neural Networks

Author: Fan Fan
Lee Kwonjoon
Tu Zhuowen
Xu Weijian
Publication venue
Publication date: 07/04/2018
Field of study

We present Wasserstein introspective neural networks (WINN) that are both a generator and a discriminator within a single model. WINN provides a significant improvement over the recent introspective neural networks (INN) method by enhancing INN's generative modeling capability. WINN has three interesting properties: (1) A mathematical connection between the formulation of the INN algorithm and that of Wasserstein generative adversarial networks (WGAN) is made. (2) The explicit adoption of the Wasserstein distance into INN results in a large enhancement to INN, achieving compelling results even with a single classifier --- e.g., providing nearly a 20 times reduction in model size over INN for unsupervised generative modeling. (3) When applied to supervised classification, WINN also gives rise to improved robustness against adversarial examples in terms of the error reduction. In the experiments, we report encouraging results on unsupervised learning problems including texture, face, and object modeling, as well as a supervised classification task against adversarial attacks.Comment: Accepted to CVPR 2018 (Oral

arXiv.org e-Print Archive

Crossref

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Author: Lee Kwonjoon
Misu Teruhisa
Wang Xin Eric
Zhou Kaiwen
Publication venue
Publication date: 09/10/2023
Field of study

In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) for visual commonsense reasoning (VCR). We categorize the problem of VCR into visual commonsense understanding (VCU) and visual commonsense inference (VCI). For VCU, which involves perceiving the literal visual content, pre-trained VLMs exhibit strong cross-dataset generalization. On the other hand, in VCI, where the goal is to infer conclusions beyond image content, VLMs face difficulties. We find that a baseline where VLMs provide perception results (image captions) to LLMs leads to improved performance on VCI. However, we identify a challenge with VLMs' passive perception, which often misses crucial context information, leading to incorrect or uncertain reasoning by LLMs. To mitigate this issue, we suggest a collaborative approach where LLMs, when uncertain about their reasoning, actively direct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. In our method, named ViCor, pre-trained LLMs serve as problem classifiers to analyze the problem category, VLM commanders to leverage VLMs differently based on the problem classification, and visual commonsense reasoners to answer the question. VLMs will perform visual recognition and understanding. We evaluate our framework on two VCR benchmark datasets and outperform all other methods that do not require in-domain supervised fine-tuning

arXiv.org e-Print Archive

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Author: Agarwal Nakul
Do Minh Quan
Fu Changcheng
Lee Kwonjoon
Sun Chen
Wang Shijie
Zhang Ce
Zhao Qi
Publication venue
Publication date: 09/10/2023
Field of study

Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGP

arXiv.org e-Print Archive

Recommended from our members

Learning Generative Models with Energy-Based Models and Transformer GANs

Author: Lee Kwonjoon
Publication venue: eScholarship, University of California
Publication date: 01/01/2022
Field of study

In this thesis, we study approaches to learn priors on data (i.e. generative modeling) and learners (i.e. meta-learning) for computer vision tasks. We present our approaches to improve the stability and performance of generative modeling and meta-learning methods. First, we study the use of natural image prior on computer vision tasks. To this end, we introduce a suite of regularization techniques that enhances the performance of energy-based models on realistic image datasets. On generative modeling, we achieve competitive results while using much smaller models. On supervised classification, we observe a significant error reduction against adversarial examples. Our model is the first computer vision model to achieve state-of-the-art image generation and classification within a single model. Next, we investigate if natural image prior can be learned with less vision-specific inductive biases. To this end, we integrate the Vision Transformer architecture into generative adversarial networks (GAN). We propose novel regularization methods and architectural choices to achieve this goal. The resulting approach, named ViTGAN, achieves comparable performance to the leading CNN-based GAN models on popular image generation benchmarks for the first time. Lastly, we study a meta-learning approach, which automatically extracts prior knowledge from the set of observed tasks. We present our work on improving the computational complexity of meta-learning. Our approach, named MetaOptNet, offers better few-shot generalization at a modest increase in computational overhead

eScholarship - University of California

A 0.8-V 82.9- $\mu$ W In-Ear BCI Controller IC With 8.8 PEF EEG Instrumentation Amplifier and Wireless BAN Transceiver

Author: Hoi-Jun Yoo
Jaeeun Jang
Jaehyuk Lee
Ji-Hoon Kim
Kwonjoon Lee
Kyoung-Rog Lee
Surin Gweon
Unsoo Ha
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref